Task 3


Binary classification

Kaggle

Dataset description

Context

This dataset is collected from UCI Machine Learning Repository through the following link: click

Data Set Information:

The data used in this study were gathered from 188 patients with PD (107 men and 81 women) with ages ranging from 33 to 87 (65.1±10.9) at the Department of Neurology in Cerrahpaya Faculty of Medicine, Istanbul University. The control group consists of 64 healthy individuals (23 men and 41 women) with ages varying between 41 and 82 (61.1±8.9). During the data collection process, the microphone is set to 44.1 KHz and following the physician's examination, the sustained phonation of the vowel /a/ was collected from each subject with three repetitions.

Attribute Information:

Various speech signal processing algorithms including Time Frequency Features, Mel Frequency Cepstral Coefficients (MFCCs), Wavelet Transform based Features, Vocal Fold Features and TWQT features have been applied to the speech recordings of Parkinson's Disease (PD) patients to extract clinically useful information for PD assessment. Related paper

Attribute description:

Similar paper

Import required libraries

EDA

There are 756 rows with a lot of columns - 755. Each person has 3 records, so there are 252 patients overall

From the dataset description, attributes are extracted using:

Various speech signal processing algorithms including Time Frequency Features, Mel Frequency Cepstral Coefficients (MFCCs), Wavelet Transform based Features, Vocal Fold Features and TWQT features have been applied to the speech recordings of Parkinson's Disease (PD) patients to extract clinically useful information for PD assessment.

Without diving into the domain area, I cannot extract features better than the authors of the related paper.

Target feature is unbalanced, like in most medical data, but this time we have 0 class underrepresented (no Parkinsons's Disease)

I guess gender is important feature, because vocal features of male and female may vary a lot.
So let's look on gender proportions in each class

We have:

Males are underrepresented in No PD group

Females are underrepresented in PD group

Spplitting the data

Correlations in the dataset:

We know that there are correlated features in this dataset, so non-robust to multicollinearity models might suffer

Feature scaling

If we look on feature distributions, we will see somewhere near to normal skewed distributions and hardly skewed distributions

These are the first 20 features, but i checked features in each attribute type (see attribute descriptin) and the distributions are pretty similar

I will use QuantileTransformer for feature scaling. This method transforms the features to follow a uniform or a normal distribution. This transformation tends to spread out the most frequent values. It also reduces the impact of outliers

First two features are id and gender, we don't need to tranform them

Pairplot of features after scaling:

And the correlations after scaling have become bigger (colors are more saturated)

Cross-validation scheme

Cross-validation in our data set requires stratifying by class and also grouping by id

Now let's check the scores for diferent models out-of-box

KNN

LogReg

DT

RF

CatBoost

LightGBM

XGBoost

Models comparison:

LGBMClassifier gives the best Recall - Precision ratio and the best accuracy

Upsampling with SMOTE

Now let's try to cross validate with SMOTE upsampling

As expected, recall decreased, precision increased, that is not really what we want

Resampling

This is how resampling method works:

Now let's cross validae on resampled data, so gender proportion in each class are equal

Seems like the scores are a little lower than on raw data.

CV scores on original data, upsampled data and resampled data compared:

Dimensionality Reduction

PCA

Let's look how the data is distributed in 3 dimensions (using PCA)

As we see, the data is not very separable even on 3 dimensions.

Let's find the optimal number of components

I would choose 150 number of components, that's 5 times less features, but they still explain most of the variance (around 95%)

That's how perform_pca method works on our data (just an example to validate):

Let's check how PCA affects our models. This time, even trees models are trained on scaled data, because we must scale the data before PCA

So the results are not good, PCA negatively affects models scores. Some models are definetly overfitted (RFC, CatBoost)

As the result model I would choose LGBMClassifier with resampling.

Model tuning

As long as I use grouping, resampling and stratifying, I have to write my own wrapper transformer with fit_resample method

I will use F1 Weighted score in GridSearch, because it takes into account both Recall and Precision (for both classes).
I do not use first class Recall for tuning, because the model will just classify almost all objects as 1 and that is a bad model

Both precision and recall has increased

We can also see that, for example, first and the last fold scores differs a lot. And that is happend because of small dataset, i guess

The second lap of GridSearch (now we specify parameters in smaller limits):

And the second lap also helped a little bit.
Pretty good scores, I think.

Results

What has been done in this work:

So we have the following model

LGBMClassifier on resampled data with the parameters:

And the mean cv scores of this model are: